Remove duplicates from Pandas Dataframe¶

05, July 2019

DataFrame.duplicated()¶

In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns.

Create a Dataframe with some duplicate rows.

In [18]:

import pandas as pd
students = [('Vishal', 34, 'Sydney'),
            ('Gaurav', 30, 'Delhi'),
            ('Madhura', 16, 'New York'),
            ('Vishakha', 30, 'Delhi'),
            ('Vishakha', 30, 'Delhi'),
            ('Vishakha', 30, 'Mumbai'),
            ('Madhura', 40, 'London'),
            ('Manavika', 30, 'Delhi')
           ]


# Create a DataFrame object
students_df = pd.DataFrame(students, columns=['Name', 'Age', 'City'])
students_df

Out[18]:

	Name	Age	City
0	Vishal	34	Sydney
1	Gaurav	30	Delhi
2	Madhura	16	New York
3	Vishakha	30	Delhi
4	Vishakha	30	Delhi
5	Vishakha	30	Mumbai
6	Madhura	40	London
7	Manavika	30	Delhi

Find Duplicate Rows based on all columns¶

To find & select the duplicates in all rows based on all columns call the Daraframe.duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence. Now pass this Boolean Series to List[] operator of Dataframe to select the rows which are duplicate i.e.

In [15]:

# Select duplicate rows except first occurrence based on all columns
students_df.duplicated()
duplicateRowsDF = students_df[students_df.duplicated()]

print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
       Name  Age   City
4  Vishakha   30  Delhi

Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.

If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’ i.e.

In [16]:

# Select duplicate rows except last occurrence based on all columns
duplicateRowsDF = students_df[students_df.duplicated(keep='last')]
 
print("Duplicate Rows except last occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except last occurrence based on all columns are :
       Name  Age   City
3  Vishakha   30  Delhi

Find Duplicate Rows based on selected columns¶

Compare rows & find duplicates based on the selected columns. For this, pass the list of column names in the subset argument: Dataframe.duplicate() function. It will select & return duplicate rows based on these passed columns only.

Find & select rows based on a single column:

In [17]:

duplicateRowsbyCol = students_df[students_df.duplicated(['Name'])]
 
print("Duplicate Rows based on a single column are:", duplicateRowsbyCol, sep='\n')

Duplicate Rows based on a single column are:
       Name  Age    City
4  Vishakha   30   Delhi
5  Vishakha   30  Mumbai
6   Madhura   40  London